AITopics | input data

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Neural Information Processing SystemsApr-28-2026, 15:28:02 GMT

Larger transformer models perform better on various downstream tasks but require more cost to scale up the model size. To efficiently enlarge models, the Mixture-of-Expert (MoE) architecture is widely adopted, which consists of a gate network and a series of experts and keep the training cost constant by routing the input data to a fixed number of experts instead of all.In existing large-scale MoE training systems, experts would be distributed among different GPUs for parallelization, and thus input data requires additional all-to-all communication to access the target expert and conduct corresponding computation. However, upon evaluating the training process of three mainstream MoE models on commonly used GPU clusters, we found that the all-to-all communication ratio averaged around 45\%, which significantly hinders the training efficiency and scalability of MoE models.In this paper, we propose LSH-MoE, a communication-efficient MoE training framework using locality-sensitive hashing (LSH). We first present the problems of scaling MoE training in existing systems and highlight the potential of exploiting token similarity to facilitate data compression.Then, we introduce an efficient LSH-based compression technique, which utilizes the cross-polytope hashing for rapid clustering and implements a residual-based error compensation scheme to alleviate the adverse impact of compression. To verify the effectiveness of our methods, we conduct experiments on both language models (e.g., RoBERTa, GPT, and T5) and vision models (e.g., Swin) for both pre-training and fine-tuning tasks. The results demonstrate that our method substantially outperforms its counterparts across different tasks by 1.28-2.2$\times$ of speedup.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.75)

Add feedback

metric

Neural Information Processing SystemsApr-25-2026, 23:44:35 GMT

Dynabench comprises four dynamic tasks with multiple rounds of datasets that will grow over time. Given that here we have to be able to evaluate a wide variety of models, both in the loop and outside of it, we employ a black box post hoc approach, i.e., one that can be applied post-data collection to existing data, on any uploaded model, without requiring anything other than its predictions. One straightforward way to measure fairness then, is to apply clearly delimited, heuristic perturbations to existing evaluation datasets, and measure whether performance drops. Such an approach is similar to recent works that use grammars to heuristically generate pairs of examples varying in gender [58] and/or race [67] in that they utilize predefined lists of words. However, because we also want to ensure minimal consequences on our classification labels, we adopted an approach that is more targeted than grammars and also preserves the original input data distribution: we replace each word in the input data that has a clear signal about race/ethnicity and/or gender identity with a similar word referring to another group, rerun inference, and measure how many labels flipped (i.e., the difference in microaverage accuracy).

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Industry: Transportation (0.37)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.98)
Information Technology > Artificial Intelligence > Natural Language (0.73)

Add feedback

mhealth_ood_neurips_2021.pdf

Neural Information Processing SystemsApr-24-2026, 21:49:17 GMT

artificial intelligence, machine learning, tnr, (17 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Therapeutic Area (0.38)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

17e23e50bedc63b4095e3d8204ce063b-Paper.pdf

Neural Information Processing SystemsApr-24-2026, 21:49:13 GMT

artificial intelligence, data mining, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.94)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Consumer Health (0.93)
(4 more...)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing (0.93)
(3 more...)

Add feedback

029f82afd78288059dc946b105c451fd-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 08:14:41 GMT

artificial intelligence, data mining, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Databases (0.71)
Information Technology > Data Science > Data Mining (0.67)

Add feedback

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

Neural Information Processing SystemsApr-24-2026, 06:24:20 GMT

Despite extensive studies, the underlying reason as to why overparameterized neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus a natural potential explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1) flatness provably implies generalization; (2) there exist non-generalizing flattest models and sharpness minimization algorithms fail to generalize poorly, and (3) perhaps most strikingly, there exist non-generalizing flattest models, but sharpness minimization algorithms still generalize. Our results suggest that the relationship between sharpness and generalization subtly depends on the data distributions and the model architectures and sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. This calls for the search for other explanations for the generalization of over-parameterized neural networks.

artificial intelligence, generalization, machine learning, (14 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback